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The characteristics of an organism are determined by the genes expressed within it. A 
method was developed, called serial analysis of gene expression (SAGE), that allows the 
quantitative and simultaneous analysis of a large number of transcripts. To demonstrate 
this strategy, short diagnostic sequence tags were isolated from pancreas, concatenated, 
and cloned. Manual sequencing of 1000 tags revealed a gene expression pattern char- 
acteristic of pancreatic function. New pancreatic transcripts corresponding to novel tags 
were identified. SAGE should provide a broadly applicable means for the quantitative 
cataloging and comparison of expressed genes in a variety of normai, developmental, and 



Determination of the genomic sequence of 
higher organisms, including humans, is now 
a real and attainable goal. However, this 
analysis represents only one level of genetic 
complexity. The ordered and timely expres- 
sion of this information represents another 
level of complexity equally important to the 
definition and biology of the organism. 
Techniques based on complementary DNA 
(cDNA) subtraction or differential display 
can be quite useful for comparing gene ex- 
pression differences between two celt types 
(I), but provide only a partial picture, 
with no direct information about abun- 
dance. The expressed sequence tag (EST) 
approach is a valuable tool for gene dis- 
covery (2), but like RNA blotting, ribo- 
nuclease (RNase) protection, and reverse 
transcriptase-polymerase chain reaction 
(RT-PGR) analysis (3), it evaluates only a 
limited number of genes at a time. Here 
we describe the serial analysis of gene 
expression (SAGE), a technique that al- 
lows a rapid, detailed analysis of thousands 
of transcripts. 

SAGE is based on two principles. First, a 
short nucleotide sequence tag [9 to 10 base 
pairs (bp)] contains sufficient information 
to uniquely identify a transcript, provided it 
is isolated from a defined position within 
the transcript. For example, a sequence as 
short as 9 bp can distinguish 262,144 tran- 
scripts (4 9 ) given a random nucleotide dis- 
tribution at the tag site, whereas current 
estimates suggest that even the human ge- 
nome encodes only about 80,000 transcripts 
(4). Second, concatenation of short, se- 
quence tags allows the efficient analysis of 
transcripts in a serial manner by the se- 
quencing of multiple tags within a single 
clone. As with serial communication by 
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computers, wherein information is trans- 
mitted as a continuous string of data, serial 
analysis of the sequence tags requires a 
means to establish the register and bound- 
aries of each tag. 

Figure 1 shows how these principles were 
implemented for the analysis of mRNA ex- 
pression. Double-stranded cDNA was syn- 



thesized from mRNA by means of a bio- 
tinylated oligo(dT) primer. The cDNA was 
then cleaved with a restriction endonucle- 
ase (anchoring enzyme) that would be ex- 
pected to cleave most transcripts at least 
once. Typically, restriction endonucleases 
with 4-bp-recognitton sites were used for 
this purpose because they cleave every 256 
bp (4 4 ) on average, whereas most tran- 
scripts are considerably larger. The most 3 r 
portion of the cleaved cDNA was then 
isolated by binding to streptavidin beads. 
This process provides a unique site on each 
transcript that corresponds to the restric- 
tion site located closest to the polyadeny- 
late [poly(A)] tail. The cDNA was then 
divided in half and ligated via the anchor- 
ing restriction site to one of two linkers 
containing a type IIS restriction site (tag- 
ging enzyme). Type HS restriction endo- 
nucleases cleave at a defined distance up to 
20 bp away from their asymmetric recogni- 
tion sites (5). The linkers are designed so 
that cleavage of the ligation products with 
the tagging enzyme results in release of the 
linker with a short piece of the cDNA. 
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Fig. 1. Schematic of SAGE. The anchoring enzyme is NIa III and the tagging enzyme is Fok i. Sequences 
colored red and green represent primer-derived sequences, whereas blue represents transcript-derived 
' sequences, with X and O indicating nucleotides of different tags. See text for further explanation. 
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For example, Fig. 1 shows a combination 
of anchoring enzyme and tagging enzyme 
that would yield a 9-bp tag. After blunt 
nds were created, the two pools of released 
tags were ligated to each other. Ligated tags 
then served as templates for polymerase 
chain reaction (PCR) amplification with 
primers specific to each linker. This step 
served several purposes in addition to allow- 
ing amplification of the tag sequences. First, 
it provided for orientation and punctuation 
of the tag sequence in a very compact man- . 

The resulting amplification products 
contained two tags (one ditag) linked tail to 
tail, flanked by sites for the anchoring en- 
zyme. In the final sequencing template, this 
resulted in 4 bp of punctuation per ditag. ' 
Second and most importantly, the analysis 
of ditags, formed before any amplification 
steps, provided a means to completely elim- 
inate potential distortions introduced by 



PCR. Because the probability of any two 
tags being coupled in the same ditag is 
small, even for abundant transcripts, repeat- 
ed ditags potentially produced by biased 
PCR could be excluded from analysis with- 
out substantially altering the final results. 
Cleavage of the PCR product with the an- 
choring enzyme allowed isolation of ditags 
that could then be concatenated by liga- 
tion, cloned, and sequenced. 

As a demonstration of this approach, 
SAGE was used to characterize gene expres- 
sion in the human pancreas. We chose Nla 111 
as the anchoring enzyme and Bsm FI as the 
tagging enzyme, yielding a 9-bp tag (6). Com- 
puter analysis of human transcripts from Gen- 
Bank indicated that greater than'95% of tags 
of this length were likely to be unique and 
that inclusion of two additional bases provid- 
ed little additional resolution (7). As outlined 
above, mRN A from human pancreas was used 



Table 1. Pancreatic SAGE tags. Tag indicates the 9-bp sequence identifying each tag, adjacent to the 
4-bp anchoring Nla !!! site, n and Percent indicate the number of times the tag was identified and its 
frequency, respectively. Gene indicates the description and accession number of the GenBank release 
87 entry found to exactly match the indicated tag when the SAGE software group was used, with the 
following exceptions. When multiple entries were identified because of duplicated entries (7), only one 
entry is listed. For chymotrypsinogen and trypsinogen 1 , other genes (adenosine triphosphatase and 
myosin alkali light chain, respectively) were identified that were predicted to contain the same tags, but 
subsequent hybridization and sequence analysis identified the listed genes as the source of the tags. Atu 
entry indicates a match with a GenBank entry for a transcript that contained at least one copy of the Alu 
consensus sequence (16). 



Tag 


Gene 


n 


Percent 


GAGCACACC 


Procarboxypeptidase A1 (X67318) 


64 


7.6 


TTCTGTGTG 


Pancreatic trypsinogen 2 (M27602) 


46 


5.5 


GAACACAAA 


Chymotrypsinogen (M24400) 


37 


4.4 


TCAGGGTGA 


Pancreatic trypsin 1 (M22612) 


31 


3.7 


GCGTGACCA 


ElastaselllB(M18692) 


20 


2.4 


GTGTGTGCT 


Protease E.(D00306) 


16 


1.9 


TCATTGGCC 


Pancreatic lipase (M93285) 


16 


1.9 


OCAGAGAGT 


Procarboxypeptidase B (M81057) 


14 


1.7 


TCCTCAAAA 


No match (see Table, 2, P1) 


14 


1.7 


AGCCTTGGT 


Bile salt stimulated lipase (X54457) 


12 


1.4 


GTGTGCGCT 


No match 


11 


1.3 


TGCGAGACC 


No match (see Table 2 P2) 


9 


1.1 


GTGAAACCC 


21 Alu entries 


8 


1.0 


GGTGACTCT 


No match 


8 


1.0 


AAGGTAACA 


Secretory trypsin inhibitor (M1 1 949) 


6 


0.7 


TCCCCTGTG 


No match 


5 


0.6 


GTGACCACG 


No match 


5 


0.6 


CCTGTAATC 


M91 1 59, M29366, 1 1 Alu entries 


5 


0.6 


CACGTTGGA 


No match 


5 


0.6 


AGCCCTACA 


No match 


5 


0.6. 


AGCACCTCC 


Elongation factor 2 (Z1 1 692) 


5 


0.6 


ACGCAGGGA 


No match (see Table 2, P3) . 


5 


0.6 


AATTGAAGA 


No match (see Table 2, P4) 


5 


0.6 


TTCTGTGGG 


No match 


4 


' 0.5 


TTCATACAC 


No match 


4 


0.5 


GTGGCAGGC 


NF-kB (X61499), Alu entry (S94541) 


4 


0.5 


GTAAAACCC 


TNF receptor II (M55994), Alu entry (X01 448) 


4 


0.5 


GAACACACA 


No match 


4 


0.5 


CCTGGGAAG 


Pancreatic mucin (J05582) 


4 


0.5 


CCCATCGTC 


Mitochondria! CytC oxidase (X15759) 


4 


0.5 


SAGE tags occurring: 


Greater than three times 


380 


45.2 




Three times (15 X 3 =) 


45 


5.4 




Two times (32X2=) 


64 


7.6 




One time 


351 


41.8 




Total SAGE tags 


840 


100.0 



to generate ditags (8) that were cloned into a 
plasmid vector (9). Clones containing at least 
10 tags (range 10 to >50) were identified by 
PCR amplification and manually sequenced 
(iO). Table 1 shows the analysis of the first 
1000 tags. Sixteen percent were eliminated 
because they either had sequence ambiguities 
or were derived from linker sequences. The 
remaining 840 tags included 351 tags jhat 
occurred once and 77 tags that were identified 
multiple times (Table 1). Nine of the 10 most 
abundant tags matched at least one entry in 
GenBank release 87 (Table 1). The remain- 
ing tag was subsequently shown to be derived 
from amylase (see below). All 10 transcripts 
were derived from genes of known pancreatic 
function, and their prevalence was consistent 
with previous analyses of pancreatic RNA 
through conventional approaches (II). 

The quantitative nature of SAGE was 
evaluated by construction of an oligo(dT)- 
primed pancreatic cDNA library that was 
screened with cDN A probes for trypsinogen 
1 and 2, procarboxypeptidase Al, chymo- 
trypsinogen, and elastase IIIB and protease 



■i SAGE 

li.. 

TRY1S PROCAR CHYMO E LA/PRO 
Fig. 2. Comparison of transcript abundance. Bars 
represent the percent abundance as determined 
by SAGE (dark bars) or hybridization analysis (light 
bars). SAGE quantitations were derived from Ta- 
ble 1 as follows: TRY1/2 includes the tags for 
trypsinogen 1 and 2; PROCAR indicates tags for 
procarboxypeptidase A1 ; CHYMO indicates tags 
for chymotrypsinogen; and ELA/PRO includes the 
tags for elastase IIIB and protease E. The cDNA 
hybridizations were as described (12). Error bars 
represent the standard deviation .determined by 
taking the square root of counted events and con- 
verting it to a percent abundance. A Poisson dis- 
tribution was assumed. 




Fig. 3. Screening a cDNA library with SAGE tags. 
P1 and P2 show typical hybridization results ob- 
tained with 13-bp oligonucleotides as described 
(73). P1 and P2 correspond to the transcripts de- 
scribed in Table 2. Images were obtained with a 
Molecular Dynamics Phosphorlmager, and the 
circle Indicates the outline of the filter membrane 
to which the recombinant phage were transferred 
before hybridization. 
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E (12). The relative abundance of the 
SAGE tags for these transcripts was in good 
agreement with the results obtained with 
library screening (Fig. 2). Furthermore, 
whereas neither trypsinogen 1 and 2 nor 
elastase HIB and protease E could be distin- 
guished by the cDNA probes used to screen 
the library {12), all four transcripts could 
readily be distinguished on the basis of their 
SAGE tags (Table 1). 

In addition to providing quantitative in- 
formation on the abundance of known tran- 
scripts, SAGE could be used to identify 
novel expressed genes. Although for the 
purposes of SAGE only the 9-bp sequence 
identifying each transcript was considered, 
each SAGE tag defines a 13-bp sequence 
composed of the anchoring enzyme (4-bp) 
site plus the 9-bp tag. As an illustration of 
this potential, 13-bp oligonucleotides were 
used to isolate the transcripts corresponding 
to four unassigned tags (Pi to P4), that is, 
tags without corresponding entries from 
GenBank release 87 (Table 1). In each of 
the four cases, it was possible to isolate 
multiple cDNA clones for the tag by simply 
screening the pancreatic cDNA library with 
the 13-bp oligonucleotide as hybridization 
probe (examples in Fig. 3) (13). In each 
case, sequencing of the derived clones iden- 
tified the correct SAGE tag at the predicted 
3' end of the identified transcript. The 
abundance of plaques identified by hybrid- 
ization with the 13-bp oligonucleotides was 
in good agreement with that predicted by 
SAGE (Table 2). Tags PI and P2 were 
shown to correspond to amylase and prepro- 
carboxypeptidase A2, respectively. No en- 
try for preprocarboxypeptidase A2 and only 
a truncated entry for amylase was present in 
GenBank release 87, thus accounting for 
their unassigned characterization. Tag P3 
did not match any genes of known function 
in GenBank but did match numerous ESTs, 
providing further evidence that it repre- 
sented a real transcript The cDNA identi- 
fied by P4 showed no significant similari- 



ties, suggesting that it represented a previ- 
ously uncharacterized pancreatic transcript. 

These results demonstrate that SAGE 
can provide both quantitative and quali- 
tative . data about gene expression. The 
combination of different anchoring en- 
zymes with various recognition sites and 
type IIS enzymes with cleavage sites 5 to 
20 bp from their recognition elements 
lends great flexibility to this strategy. As 
efforts to fully characterize the genome 
near completion, SAGE should allow a 
direct readout of expression in any given 
cell type or tissue. In the interim, we 
envision that the major application of 
SAGE will be the comparison of gene 
expression patterns in various develop- 
mental and disease states. Any laboratory 
with the capability to perform PCR and 
manual sequencing could perform SAGE 
for this purpose. Adaptation of this tech- 
nique to an automated sequencer would 
allow the analysis of over 1000 transcripts 
in a single 3-hour run (14). 

The appropriate number of tags to be 
determined will depend on the applica- 
tion. For example, the definition of genes 
expressed at relatively high levels (0.5% 
or more) in one tissue, but low in another, 
would require only a single day. Determi- 
nation of transcripts expressed at greater - 
than 100 mRNAs per cell (0.025%) 
should be quantifiable within a few 
months by a single investigator. Use of 
different anchoring enzymes will ensure 
that virtually all transcripts of the desired 
abundance can be identified. The genes 
encoding those tags shown to be most 
interesting on the basis of their differen- 
tial representation can be positively iden- 
tified by a combination of database search- 
ing, hybridization, and sequence analysis 
as demonstrated in Tables 1 and 2. Obvi- 
ously, SAGE could also be applied to the 
analysis of organisms other than humans 
and could direct investigation toward 
genes expressed in specific biologic states. 



Table 2. Characterizations of unassigned SAGE tags. Tag and SAGE Abundance are as described in 
Table 1 ; 13-mer hyb. indicates the results obtained by screening a cDNA library with a 13-bp oligonu- 
cleotide (73). The number of positive plaques divided by the total plaques screened is indicated in 
parenthesis after the percent abundance. A positive in the SAGE Tag column indicates that the expected 
SAGE tag was identified at the 3' end of isolated clones. Description indicates the results of BLAST 
searches of the daily updated GenBank entries at NCBI (National Center for Biotechnology information) 
as of 9 June 1995 (16). A description and accession number are given for the most significant matches. 
P1 was found to match a truncated entry for amylase, and P2 was found to match an unpublished entry 
for preprocarboxypeptidase A2 that was entered after GenBank release 87. 



Tag 


Abundance (%) 
SAGE 13-mer hyb. 


SAGE 
tag 


Description 


P1 TCCTCAAAA 


1.7 1.5(6/388) 


+ 3'« 


>nd of pancreatic amylase (M28443) 


P2 TGCGAGACC 


1.1 1.2(43/3700) 


+ 3' end of p 


Bprocarboxypeptidase A2 (U19977) 


P3 ACGCAGGGA 


0.6 0.2 (5/2772) 


+ 


EST match (R45808) 


P4 AATTGAAGA 


0.6 0.4(6/1587) 


+ 


. No match 
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■TECHNICAL COMMENT. 



The Radius of Gyration of an 
Apomyoglobin Folding Intermediate 



Apomyoglobin (apoMb) forms a stable 
compact partially folded state under acidic 
conditions {!). This "molten globule" in- 
termediate is slightly expanded relative to 
the native form of the protein, with a 
radius of gyration (R ) of 23 (± 2) A 
versus 19 (± 1) A (2), and shows stable 
secondary structure (3) in the A, G, and H 
helices (Fig. 1). 

We demonstrated recently, with the use 
of stopped-ftow circular dichroism and 
pulse-labeling hydrogen exchange measure- 
ments, that the earliest detectable interme- 
diate (formed within 6 ms) in the apoMb 
kinetic refolding pathway closely resembles 
the equilibrium molten globule state popu- 




Fig. 1 . Sketch of the structure of holo-myoglobin, 
illustrating the location of the A, G, and H helices, 
which are present in both the equilibrium and ki- 
netic folding intermediates of the apoprotein. 



lated under acid conditions (4). A key ques- 
tion remained as to how compact this ki- 
netic intermediate is compared to the equi- 
librium and native states. The cooperative 
unfolding of the kinetic intermediate and 
the significant protection from amide pro- 
ton exchange (as compared to correspond- 
ing isolated peptides in solution) led us to 
propose that the kinetic intermediate is also 
compact (4, 5). Such a proposal could best 
be verified by direct determination of the 
size of the protein as it folds, but measure- 
ments of this nature were not feasible at the 
time. 

Newly developed improvements in time- 
resolved ■ small angle x-ray scattering 
(SAXS) experiments allow direct measure- 
ment of the time-dependent change of R g of 
a protein as it folds in the miltisecond to 
second time frame (6, 7). We initiated stud- 



ies of the refolding of apoMb using this 
technique, under conditions similar to 
those employed in our previous work (4). 
SAXS data collected during the first 100 ms 
after initiation of the refolding reaction (8) 
are shown in Fig. 2. 

Data collected from the fully refolded 
protein and unfolded protein are given for 
comparison (Fig. 2). The data obtained 100 
ms after the initiation of folding are within 
experimental error of the data obtained for 
the refolded protein, and easily distinguish- 
able from data obtained for the unfolded 
state. An R g value of 23 (± 2) A is obtained 
at 100 ms, only 1 A greater than the 22 (± 
1) A value obtained for the refolded pro- 
tein. By contrast, the unfolded state has an 
R g of 34 (± 2) A. The slightly higher than 
expected R g value obtained for the refolded 
state may result from either experimental 
error (9) or a small degree of sample aggre- 
gation owing to radiation damage during 
exposure. It is possible that the R g value 
obtained at 100 ms is similarly inflated, and 
it may therefore be considered an upper 
bound on the true R g . 

Our conclusion that the intermediate is 
compact is based on the small differences 




Fig. 2. SAXS data from 
sperm whale apomyoglobin 
after 100 ms of folding, after 
4.2 s of folding, and in the 
unfolded state. Detected In- 
tensity is plotted as a func- 
tion of K. Data from the un- 
folded state is scaled to 
match the folded state data 
at zero scattering angle. The 
data obtained from the fully 
folded protein and that ob- 
tained after 100 ms of fold- 
ing are barely distinguish- 
able from each other and are 
different from the data for 
the unfolded protein. 
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